COVID-19 Literature Clustering

Goal¶

Given the rapidly growing amount of literature on COVID-19, it is difficult to keep up with the major research trends being explored on this topic. Can we cluster similar research articles to make it easier for health professionals to find relevant research trends?

Dataset we will be using¶

In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up.

Cite: COVID-19 Open Research Dataset Challenge (CORD-19) | Kaggle

Loading Data¶

The data loading procedure described below is adapted from the following notebook by Ivan Ega Pratama, from Kaggle Dataset Parsing Code | Kaggle, COVID EDA: Initial Exploration Tool. Before running the procedures below, make sure to unzip the file CORD-19-research-challenge.zip inside the Experiment3 folder.

Loading Metadata¶

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import glob
import json

import matplotlib.pyplot as plt
plt.style.use('ggplot')

Let's load the metadata of the articles. 'title' and 'journal' attributes may be useful later when we cluster the articles to see what kinds of articles cluster together.

In [2]:
root_path = 'CORD-19-research-challenge/'
metadata_path = f'{root_path}/metadata.csv'
meta_df = pd.read_csv(metadata_path, dtype={
    'pubmed_id': str,
    'Microsoft Academic Paper ID': str, 
    'doi': str
})
meta_df.head()
Out[2]:
cord_uid sha source_x title doi pmcid pubmed_id license abstract publish_time authors journal Microsoft Academic Paper ID WHO #Covidence has_pdf_parse has_pmc_xml_parse full_text_file url
0 xqhn0vbp 1e1286db212100993d03cc22374b624f7caee956 PMC Airborne rhinovirus detection and effect of ul... 10.1186/1471-2458-3-5 PMC140314 12525263 no-cc BACKGROUND: Rhinovirus, the most common cause ... 2003-01-13 Myatt, Theodore A; Johnston, Sebastian L; Rudn... BMC Public Health NaN NaN True True custom_license https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
1 gi6uaa83 8ae137c8da1607b3a8e4c946c07ca8bda67f88ac PMC Discovering human history from stomach bacteria 10.1186/gb-2003-4-5-213 PMC156578 12734001 no-cc Recent analyses of human pathogens have reveal... 2003-04-28 Disotell, Todd R Genome Biol NaN NaN True True custom_license https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
2 le0ogx1s NaN PMC A new recruit for the army of the men of death 10.1186/gb-2003-4-7-113 PMC193621 12844350 no-cc The army of the men of death, in John Bunyan's... 2003-06-27 Petsko, Gregory A Genome Biol NaN NaN False True custom_license https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
3 fy4w7xz8 0104f6ceccf92ae8567a0102f89cbb976969a774 PMC Association of HLA class I with severe acute r... 10.1186/1471-2350-4-9 PMC212558 12969506 no-cc BACKGROUND: The human leukocyte antigen (HLA) ... 2003-09-12 Lin, Marie; Tseng, Hsiang-Kuang; Trejaut, Jean... BMC Med Genet NaN NaN True True custom_license https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...
4 0qaoam29 5b68a553a7cbbea13472721cd1ad617d42b40c26 PMC A double epidemic model for the SARS propagation 10.1186/1471-2334-3-19 PMC222908 12964944 no-cc BACKGROUND: An epidemic of a Severe Acute Resp... 2003-09-10 Ng, Tuen Wai; Turinici, Gabriel; Danchin, Antoine BMC Infect Dis NaN NaN True True custom_license https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...
In [3]:
meta_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51078 entries, 0 to 51077
Data columns (total 18 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   cord_uid                     51078 non-null  object
 1   sha                          38022 non-null  object
 2   source_x                     51078 non-null  object
 3   title                        50920 non-null  object
 4   doi                          47741 non-null  object
 5   pmcid                        41082 non-null  object
 6   pubmed_id                    37861 non-null  object
 7   license                      51078 non-null  object
 8   abstract                     42352 non-null  object
 9   publish_time                 51070 non-null  object
 10  authors                      48891 non-null  object
 11  journal                      46368 non-null  object
 12  Microsoft Academic Paper ID  964 non-null    object
 13  WHO #Covidence               1768 non-null   object
 14  has_pdf_parse                51078 non-null  bool  
 15  has_pmc_xml_parse            51078 non-null  bool  
 16  full_text_file               42511 non-null  object
 17  url                          50776 non-null  object
dtypes: bool(2), object(16)
memory usage: 6.3+ MB

Fetch All of JSON File Path¶

Next, we will get the paths to all JSON files:

In [4]:
all_json = glob.glob(f'{root_path}/**/*.json', recursive=True)
len(all_json)
Out[4]:
59311

Helper Functions¶

We need the following helper functions for reading files and adding breaks after every word.

In [5]:
class FileReader:
    def __init__(self, file_path):
        with open(file_path) as file:
            content = json.load(file)
            self.paper_id = content['paper_id']
            self.abstract = []
            self.body_text = []
            # Abstract
            for entry in content['abstract']:
                self.abstract.append(entry['text'])
            # Body text
            for entry in content['body_text']:
                self.body_text.append(entry['text'])
            self.abstract = '\n'.join(self.abstract)
            self.body_text = '\n'.join(self.body_text)
    def __repr__(self):
        return f'{self.paper_id}: {self.abstract[:200]}... {self.body_text[:200]}...'
    
# Helper function adds break after every words when character length reach to certain amount. This is for the interactive plot so that hover tool fits the screen.
    
def get_breaks(content, length):
    data = ""
    words = content.split(' ')
    total_chars = 0

    # add break every length characters
    for i in range(len(words)):
        total_chars += len(words[i])
        if total_chars > length:
            data = data + "<br>" + words[i]
            total_chars = 0
        else:
            data = data + " " + words[i]
    return data

Load the Data into DataFrame¶

Using theses helper functions, let's read in the articles into a DataFrame that can be used easily:

In [6]:
dict_ = {'paper_id': [], 'abstract': [], 'body_text': [], 'authors': [], 'title': [], 'journal': [], 'abstract_summary': []}
for idx, entry in enumerate(all_json):
    try:
        if idx % (len(all_json) // 10) == 0:
            print(f'Processing index: {idx} of {len(all_json)}')
        content = FileReader(entry)

        # get metadata information
        meta_data = meta_df.loc[meta_df['sha'] == content.paper_id]
        # no metadata, skip this paper
        if len(meta_data) == 0:
            continue

        dict_['paper_id'].append(content.paper_id)
        dict_['abstract'].append(content.abstract)
        dict_['body_text'].append(content.body_text)

        # also create a column for the summary of abstract to be used in a plot
        if len(content.abstract) == 0: 
            # no abstract provided
            dict_['abstract_summary'].append("Not provided.")
        elif len(content.abstract.split(' ')) > 100:
            # abstract provided is too long for plot, take first 300 words append with ...
            info = content.abstract.split(' ')[:100]
            summary = get_breaks(' '.join(info), 40)
            dict_['abstract_summary'].append(summary + "...")
        else:
            # abstract is short enough
            summary = get_breaks(content.abstract, 40)
            dict_['abstract_summary'].append(summary)

        # get metadata information
        meta_data = meta_df.loc[meta_df['sha'] == content.paper_id]

        try:
            # if more than one author
            authors = meta_data['authors'].values[0].split(';')
            if len(authors) > 2:
                # more than 2 authors, may be problem when plotting, so take first 2 append with ...
                dict_['authors'].append(". ".join(authors[:2]) + "...")
            else:
                # authors will fit in plot
                dict_['authors'].append(". ".join(authors))
        except Exception as e:
            # if only one author - or Null valie
            dict_['authors'].append(meta_data['authors'].values[0])

        # add the title information, add breaks when needed
        try:
            title = get_breaks(meta_data['title'].values[0], 40)
            dict_['title'].append(title)
        # if title was not provided
        except Exception as e:
            dict_['title'].append(meta_data['title'].values[0])

        # add the journal information
        dict_['journal'].append(meta_data['journal'].values[0])
        
    
    except Exception as e:
        continue
    
df_covid = pd.DataFrame(dict_, columns=['paper_id', 'abstract', 'body_text', 'authors', 'title', 'journal', 'abstract_summary'])
df_covid.head()
Processing index: 0 of 59311
Processing index: 5931 of 59311
Processing index: 11862 of 59311
Processing index: 17793 of 59311
Processing index: 23724 of 59311
Processing index: 29655 of 59311
Processing index: 35586 of 59311
Processing index: 41517 of 59311
Processing index: 47448 of 59311
Processing index: 53379 of 59311
Processing index: 59310 of 59311
Out[6]:
paper_id abstract body_text authors title journal abstract_summary
0 0015023cc06b5362d332b3baf348d11567ca2fbb word count: 194 22 Text word count: 5168 23 24... VP3, and VP0 (which is further processed to VP... Joseph C. Ward. Lidia Lasecka-Dykes... The RNA pseudoknots in foot-and-mouth disease... NaN word count: 194 22 Text word count: 5168 23 2...
1 00340eea543336d54adda18236424de6a5e91c9d During the past three months, a new coronaviru... In December 2019, a novel coronavirus, SARS-Co... Carla Mavian. Simone Marini... Regaining perspective on SARS-CoV-2<br>molecu... NaN During the past three months, a new coronavir...
2 004f0f8bb66cf446678dc13cf2701feec4f36d76 The 2019-nCoV epidemic has spread across China... Hanchu Zhou. Jianan Yang... Healthcare-resource-adjusted<br>vulnerabiliti... NaN Not provided.
3 00911cf4f99a3d5ae5e5b787675646a743574496 The fast accumulation of viral metagenomic dat... Metagenomic sequencing, which allows us to dir... Jiayu Shang. Yanni Sun CHEER: hierarCHical taxonomic<br>classificati... NaN The fast accumulation of viral metagenomic<br...
4 00d16927588fb04d4be0e6b269fc02f0d3c2aa7b Infectious bronchitis (IB) causes significant ... Infectious bronchitis (IB), which is caused by... Salman L. Butt. Eric C. Erwood... Real-time, MinION-based, amplicon<br>sequenci... NaN Infectious bronchitis (IB) causes<br>signific...
In [7]:
dict_ = None

Adding the Word Count Columns¶

We add two extra columns related to the word count of the abstract and body_text, which can be useful features later:

In [8]:
df_covid['abstract_word_count'] = df_covid['abstract'].apply(lambda x: len(x.strip().split()))
df_covid['body_word_count'] = df_covid['body_text'].apply(lambda x: len(x.strip().split()))
df_covid.head()
Out[8]:
paper_id abstract body_text authors title journal abstract_summary abstract_word_count body_word_count
0 0015023cc06b5362d332b3baf348d11567ca2fbb word count: 194 22 Text word count: 5168 23 24... VP3, and VP0 (which is further processed to VP... Joseph C. Ward. Lidia Lasecka-Dykes... The RNA pseudoknots in foot-and-mouth disease... NaN word count: 194 22 Text word count: 5168 23 2... 241 1728
1 00340eea543336d54adda18236424de6a5e91c9d During the past three months, a new coronaviru... In December 2019, a novel coronavirus, SARS-Co... Carla Mavian. Simone Marini... Regaining perspective on SARS-CoV-2<br>molecu... NaN During the past three months, a new coronavir... 175 2549
2 004f0f8bb66cf446678dc13cf2701feec4f36d76 The 2019-nCoV epidemic has spread across China... Hanchu Zhou. Jianan Yang... Healthcare-resource-adjusted<br>vulnerabiliti... NaN Not provided. 0 755
3 00911cf4f99a3d5ae5e5b787675646a743574496 The fast accumulation of viral metagenomic dat... Metagenomic sequencing, which allows us to dir... Jiayu Shang. Yanni Sun CHEER: hierarCHical taxonomic<br>classificati... NaN The fast accumulation of viral metagenomic<br... 139 5188
4 00d16927588fb04d4be0e6b269fc02f0d3c2aa7b Infectious bronchitis (IB) causes significant ... Infectious bronchitis (IB), which is caused by... Salman L. Butt. Eric C. Erwood... Real-time, MinION-based, amplicon<br>sequenci... NaN Infectious bronchitis (IB) causes<br>signific... 1647 4003
In [9]:
df_covid.describe(include='all')
Out[9]:
paper_id abstract body_text authors title journal abstract_summary abstract_word_count body_word_count
count 36009 36009 36009 35413 35973 34277 36009 36009.000000 36009.000000
unique 36009 26249 35981 33538 35652 5410 26239 NaN NaN
top 0015023cc06b5362d332b3baf348d11567ca2fbb In previous reports, workers have characterize... Domingo, Esteban In the Literature PLoS One Not provided. NaN NaN
freq 1 9704 3 14 9 1518 9704 NaN NaN
mean NaN NaN NaN NaN NaN NaN NaN 160.511678 4705.127163
std NaN NaN NaN NaN NaN NaN NaN 168.348075 6944.838042
min NaN NaN NaN NaN NaN NaN NaN 0.000000 1.000000
25% NaN NaN NaN NaN NaN NaN NaN 0.000000 2370.000000
50% NaN NaN NaN NaN NaN NaN NaN 158.000000 3645.000000
75% NaN NaN NaN NaN NaN NaN NaN 235.000000 5450.000000
max NaN NaN NaN NaN NaN NaN NaN 4767.000000 260378.000000

Handling Possible Duplicates¶

When we look at the unique values above, we can see that there are duplicates. It may have caused because of author submiting the article to multiple journals. Let's remove the duplicates from our dataset, and also those articles that are missing abstracts or body_text

In [10]:
df_covid.dropna(inplace=True)
df_covid = df_covid[df_covid.abstract != ''] #Remove rows which are missing abstracts
df_covid = df_covid[df_covid.body_text != ''] #Remove rows which are missing body_text
df_covid.drop_duplicates(['abstract', 'body_text'], inplace=True) # remove duplicate rows having same abstract and body_text
df_covid.describe(include='all')
Out[10]:
paper_id abstract body_text authors title journal abstract_summary abstract_word_count body_word_count
count 24584 24584 24584 24584 24584 24584 24584 24584.000000 24584.000000
unique 24584 24552 24584 23709 24545 3963 24545 NaN NaN
top 00142f93c18b07350be89e96372d240372437ed9 Travel Medicine and Infectious Disease xxx (xx... iNTRODUCTiON Human beings are constantly expos... Woo, Patrick C. Y.. Lau, Susanna K. P.... Respiratory Infections PLoS One Travel Medicine and Infectious Disease xxx<br... NaN NaN
freq 1 5 1 7 3 1514 5 NaN NaN
mean NaN NaN NaN NaN NaN NaN NaN 216.446673 4435.475106
std NaN NaN NaN NaN NaN NaN NaN 137.065117 3657.421423
min NaN NaN NaN NaN NaN NaN NaN 1.000000 23.000000
25% NaN NaN NaN NaN NaN NaN NaN 147.000000 2711.000000
50% NaN NaN NaN NaN NaN NaN NaN 200.000000 3809.500000
75% NaN NaN NaN NaN NaN NaN NaN 255.000000 5431.000000
max NaN NaN NaN NaN NaN NaN NaN 3694.000000 232431.000000
In [11]:
df_covid.head()
Out[11]:
paper_id abstract body_text authors title journal abstract_summary abstract_word_count body_word_count
1625 00142f93c18b07350be89e96372d240372437ed9 Dendritic cells (DCs) are specialized antigen-... iNTRODUCTiON Human beings are constantly expos... Geginat, Jens. Nizzoli, Giulia... Immunity to Pathogens Taught by Specialized<b... Front Immunol Dendritic cells (DCs) are specialized<br>anti... 309 5305
1626 0022796bb2112abd2e6423ba2d57751db06049fb Dengue has a negative impact in low-and lower ... Pathogens and vectors can now be transported r... Viennet, Elvina. Ritchie, Scott A.... Public Health Responses to and Challenges for... PLoS Negl Trop Dis Dengue has a negative impact in low-and lower... 276 7288
1627 0031e47b76374e05a18c266bd1a1140e5eacb54f Fecal microbial transplantation (FMT), a treat... a1111111111 a1111111111 a1111111111 a111111111... McKinney, Caroline A.. Oliveira, Bruno C. M.... The fecal microbiota of healthy donor horses<... PLoS One Fecal microbial transplantation (FMT), a<br>t... 141 4669
1628 00326efcca0852dc6e39dc6b7786267e1bc4f194 Fifteen years ago, United Nations world leader... In addition to preventative care and nutrition... Turner, Erin L.. Nielsen, Katie R.... A Review of Pediatric Critical Care in<br>Res... Front Pediatr Fifteen years ago, United Nations world<br>le... 151 7593
1629 00352a58c8766861effed18a4b079d1683fec2ec Posttranslational modification of proteins by ... Ubiquitination is a widely used posttranslatio... Hodul, Molly. Dahlberg, Caroline L.... Function of the Deubiquitinating Enzyme USP46... Front Synaptic Neurosci Posttranslational modification of proteins<br... 148 3156
In [12]:
df_covid.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24584 entries, 1625 to 36008
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   paper_id             24584 non-null  object
 1   abstract             24584 non-null  object
 2   body_text            24584 non-null  object
 3   authors              24584 non-null  object
 4   title                24584 non-null  object
 5   journal              24584 non-null  object
 6   abstract_summary     24584 non-null  object
 7   abstract_word_count  24584 non-null  int64 
 8   body_word_count      24584 non-null  int64 
dtypes: int64(2), object(7)
memory usage: 1.9+ MB

Pre-processing Data:¶

Let us limit the number of articles to speed up computation:

In [13]:
df_covid = df_covid.head(12500)

Now let's remove punctuation from each text:

In [14]:
import re

df_covid['body_text'] = df_covid['body_text'].apply(lambda x: re.sub('[^a-zA-z0-9\s]','',x))
df_covid['abstract'] = df_covid['abstract'].apply(lambda x: re.sub('[^a-zA-z0-9\s]','',x))

Convert each text to lower case:

In [15]:
def lower_case(input_str):
    input_str = input_str.lower()
    return input_str

df_covid['body_text'] = df_covid['body_text'].apply(lambda x: lower_case(x))
df_covid['abstract'] = df_covid['abstract'].apply(lambda x: lower_case(x))
In [16]:
df_covid.head(4)
Out[16]:
paper_id abstract body_text authors title journal abstract_summary abstract_word_count body_word_count
1625 00142f93c18b07350be89e96372d240372437ed9 dendritic cells dcs are specialized antigenpre... introduction human beings are constantly expos... Geginat, Jens. Nizzoli, Giulia... Immunity to Pathogens Taught by Specialized<b... Front Immunol Dendritic cells (DCs) are specialized<br>anti... 309 5305
1626 0022796bb2112abd2e6423ba2d57751db06049fb dengue has a negative impact in lowand lower m... pathogens and vectors can now be transported r... Viennet, Elvina. Ritchie, Scott A.... Public Health Responses to and Challenges for... PLoS Negl Trop Dis Dengue has a negative impact in low-and lower... 276 7288
1627 0031e47b76374e05a18c266bd1a1140e5eacb54f fecal microbial transplantation fmt a treatmen... a1111111111 a1111111111 a1111111111 a111111111... McKinney, Caroline A.. Oliveira, Bruno C. M.... The fecal microbiota of healthy donor horses<... PLoS One Fecal microbial transplantation (FMT), a<br>t... 141 4669
1628 00326efcca0852dc6e39dc6b7786267e1bc4f194 fifteen years ago united nations world leaders... in addition to preventative care and nutrition... Turner, Erin L.. Nielsen, Katie R.... A Review of Pediatric Critical Care in<br>Res... Front Pediatr Fifteen years ago, United Nations world<br>le... 151 7593

Now that we have the text cleaned up, we can create our features vector which can be fed into a clustering or dimensionality reduction algorithm. For our first try, we will focus on the text on the body of the articles. Let's grab that:

In [17]:
text = df_covid.drop(["paper_id", "abstract", "abstract_word_count", "body_word_count", "authors", "title", "journal", "abstract_summary"], axis=1)
In [18]:
text.head(5)
Out[18]:
body_text
1625 introduction human beings are constantly expos...
1626 pathogens and vectors can now be transported r...
1627 a1111111111 a1111111111 a1111111111 a111111111...
1628 in addition to preventative care and nutrition...
1629 ubiquitination is a widely used posttranslatio...

Let's transform this 1D DataFrame into a 1D list where each index is an article (instance), so that we can work with words from each instance:

In [19]:
text_arr = text.stack().tolist()
len(text_arr)
Out[19]:
12500

Next, let's create a 2D list, where each row is an instance and each column is a word. Meaning, we will separate each instance into words:

In [20]:
words = []
for ii in range(0,len(text)):
    words.append(str(text.iloc[ii]['body_text']).split(" "))
In [21]:
print(words[0][:20])
['introduction', 'human', 'beings', 'are', 'constantly', 'exposed', 'to', 'a', 'myriad', 'of', 'pathogens', 'including', 'bacteria', 'fungi', 'and', 'viruses', 'these', 'foreign', 'invaders', 'or']

What we want now is to create n-grams from the words with n=2 (i.e., 2-grams). A 2-gram is a sequence of two words appearing together (e.g., 'thank you'). The motivation behind using 2-grams is to describe a document using pairs of consecutive words, instead of individual words, so as to capture the co-occurrence information of words at adjacent positions in a document. We have created a 2D list where each row is an instance (or document) and every column is a word. We need to create a similar 2D list where every row is an instance and every column is a 2-gram.

In [22]:
n_gram_all = []

for word in words:
    # get n-grams for the instance
    n_gram = []
    for i in range(len(word)-2+1):
        n_gram.append("".join(word[i:i+2]))
    n_gram_all.append(n_gram)
In [23]:
n_gram_all[0][:10]
Out[23]:
['introductionhuman',
 'humanbeings',
 'beingsare',
 'areconstantly',
 'constantlyexposed',
 'exposedto',
 'toa',
 'amyriad',
 'myriadof',
 'ofpathogens']
In [24]:
M = len(n_gram_all)
N = len(n_gram_all[1])
print("{} X {}".format(M, N))
12500 X 7228
In [25]:
# Answer 2
words_n_gram_all = []
new_words = [['the', '2019', 'novel', 'coronavirus', 'sarscov2', 'identified', 'as', 'the', 'cause']]
for word in new_words:
    # get n-grams for the instance
    n_gram = []
    for i in range(len(word)-2+1):
        n_gram.append("".join(word[i:i+2]))
    words_n_gram_all.append(n_gram)
In [26]:
words_n_gram_all[0][:]
Out[26]:
['the2019',
 '2019novel',
 'novelcoronavirus',
 'coronavirussarscov2',
 'sarscov2identified',
 'identifiedas',
 'asthe',
 'thecause']

Vectorize with HashingVectorizer¶

To vectorize the set of n-gram features constructed for every document, we will use the in-built HashVectorizer function. We will limit the feature size to 2^12 (4096) to speed up computations. We might need to increase this later to improve the accuracy:

In [27]:
from sklearn.feature_extraction.text import HashingVectorizer

# hash vectorizer instance
hvec = HashingVectorizer(lowercase=False, analyzer=lambda l:l, n_features=2**12)

# features matrix X
X = hvec.fit_transform(n_gram_all)
In [28]:
X.shape
Out[28]:
(12500, 4096)

Dimensionality Reduction with t-SNE¶

We first reduce the dimensionality of X from a 4096-dimensional space to a 2-dimensional space using a popular non-linear dimensionality reduction technique called t-SNE. In this process, t-SNE will keep similar instances together while trying to push different instances far from each other. Resulting 2-D scatter plot of t-SNE features can be useful to see which articles cluster near each other.

In [29]:
# Following cell may take 20-30 minutes to run

from sklearn.manifold import TSNE

tsne = TSNE(verbose=1, perplexity=5)
X_embedded = tsne.fit_transform(X.toarray())
C:\Users\ankit\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  warnings.warn(
C:\Users\ankit\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(
[t-SNE] Computing 16 nearest neighbors...
[t-SNE] Indexed 12500 samples in 0.075s...
[t-SNE] Computed neighbors for 12500 samples in 18.400s...
[t-SNE] Computed conditional probabilities for sample 1000 / 12500
[t-SNE] Computed conditional probabilities for sample 2000 / 12500
[t-SNE] Computed conditional probabilities for sample 3000 / 12500
[t-SNE] Computed conditional probabilities for sample 4000 / 12500
[t-SNE] Computed conditional probabilities for sample 5000 / 12500
[t-SNE] Computed conditional probabilities for sample 6000 / 12500
[t-SNE] Computed conditional probabilities for sample 7000 / 12500
[t-SNE] Computed conditional probabilities for sample 8000 / 12500
[t-SNE] Computed conditional probabilities for sample 9000 / 12500
[t-SNE] Computed conditional probabilities for sample 10000 / 12500
[t-SNE] Computed conditional probabilities for sample 11000 / 12500
[t-SNE] Computed conditional probabilities for sample 12000 / 12500
[t-SNE] Computed conditional probabilities for sample 12500 / 12500
[t-SNE] Mean sigma: 0.128966
[t-SNE] KL divergence after 250 iterations with early exaggeration: 149.291901
[t-SNE] KL divergence after 1000 iterations: 4.166944

Let's plot the result:

In [30]:
from matplotlib import pyplot as plt
import seaborn as sns

# sns settings
sns.set(rc={'figure.figsize':(15,15)})

# colors
palette = sns.color_palette("bright", 1)

# plot
sns.scatterplot(X_embedded[:,0], X_embedded[:,1], palette=palette)

plt.title("t-SNE Covid-19 Articles")
# plt.savefig("plots/t-sne_covid19.png")
plt.show()
C:\Users\ankit\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

We can clearly see few clusters forming. However, without labels, it is difficult to see the clusters. Let's try if we can use K-Means to generate our labels for these clusters. We can later use this information to produce a scatterplot with labels to verify the clusters.

Unsupervised Learning: Clustering with K-Means¶

Let us apply K-means with K = 10. For computational efficiency, we will use the Minibatch version of Kmeans.

In [31]:
from sklearn.cluster import MiniBatchKMeans

k = 10
kmeans = MiniBatchKMeans(n_clusters=k)
y_pred = kmeans.fit_predict(X)

Now that we have the labels, let's plot the t-SNE scatterplot again and see if K-means is able to capture the pattern of clusters in the data:

In [32]:
# sns settings
sns.set(rc={'figure.figsize':(15,15)})

# colors
palette = sns.color_palette("bright", len(set(y_pred)))

# plot
sns.scatterplot(X_embedded[:,0], X_embedded[:,1], hue=y_pred, legend='full', palette=palette)
plt.title("t-SNE Covid-19 Articles - Clustered")
# plt.savefig("plots/t-sne_covid19_label.png")
plt.show()
C:\Users\ankit\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

This looks pretty promising. We can see that articles from the same cluster are near each other, forming groups. There are still overlaps, though. So we will have to see if we can improve this by changing the number of clusters (K), using another clustering algorithm, or using a different feature size in the HashingVectorizer. We can also consider trying 3-grams, 4-grams, or 1-gram (plain text) instead of 2-grams to create features and vectorize them using different document vectorization methods, e.g., HashVectorizer or Tf-idfVectorizer.

In [33]:
# Answer 3
k = 18
kmeans = MiniBatchKMeans(n_clusters=k)
y_pred = kmeans.fit_predict(X)
In [34]:
# sns settings
sns.set(rc={'figure.figsize':(15,15)})

# colors
palette = sns.color_palette("bright", len(set(y_pred)))

# plot
sns.scatterplot(X_embedded[:,0], X_embedded[:,1], hue=y_pred, legend='full', palette=palette)
plt.title("t-SNE Covid-19 Articles - Clustered")
plt.savefig("t-sne_covid19_label.png")
plt.show()
C:\Users\ankit\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

Vectorize Using Tf-idf with Plain Text¶

Let's see if we will be able to get better clusters using plain text as instances rather than 2-grams and vectorize it using Tf-idf.

Vectorize¶

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=2**12)
X = vectorizer.fit_transform(df_covid['body_text'].values)
In [37]:
X.shape
Out[37]:
(12500, 4096)

KMeans with Plain text and Tf-idf¶

Again, let's try to get our cluster labels. We will choose 10 clusters again.

In [39]:
from sklearn.cluster import MiniBatchKMeans

k = 10
kmeans = MiniBatchKMeans(n_clusters=k)
y_pred = kmeans.fit_predict(X)

Get the labels:

In [40]:
y = y_pred

Dimensionality Reduction with t-SNE (Plain text and Tf-idf)¶

Let's reduce the dimensionality using t-SNE again:

In [41]:
# Following cell will take 20-30 minutes to run
from sklearn.manifold import TSNE

tsne = TSNE(verbose=1)
X_embedded = tsne.fit_transform(X.toarray())
C:\Users\ankit\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  warnings.warn(
C:\Users\ankit\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 12500 samples in 0.132s...
[t-SNE] Computed neighbors for 12500 samples in 17.368s...
[t-SNE] Computed conditional probabilities for sample 1000 / 12500
[t-SNE] Computed conditional probabilities for sample 2000 / 12500
[t-SNE] Computed conditional probabilities for sample 3000 / 12500
[t-SNE] Computed conditional probabilities for sample 4000 / 12500
[t-SNE] Computed conditional probabilities for sample 5000 / 12500
[t-SNE] Computed conditional probabilities for sample 6000 / 12500
[t-SNE] Computed conditional probabilities for sample 7000 / 12500
[t-SNE] Computed conditional probabilities for sample 8000 / 12500
[t-SNE] Computed conditional probabilities for sample 9000 / 12500
[t-SNE] Computed conditional probabilities for sample 10000 / 12500
[t-SNE] Computed conditional probabilities for sample 11000 / 12500
[t-SNE] Computed conditional probabilities for sample 12000 / 12500
[t-SNE] Computed conditional probabilities for sample 12500 / 12500
[t-SNE] Mean sigma: 0.206152
[t-SNE] KL divergence after 250 iterations with early exaggeration: 96.178764
[t-SNE] KL divergence after 1000 iterations: 2.357430

Plot t-SNE¶

In [42]:
from matplotlib import pyplot as plt
import seaborn as sns

# sns settings
sns.set(rc={'figure.figsize':(15,15)})

# colors
palette = sns.color_palette("bright", len(set(y)))

# plot
sns.scatterplot(X_embedded[:,0], X_embedded[:,1], hue=y, legend='full', palette=palette)
plt.title("t-SNE Covid-19 Articles - Clustered(K-Means) - Tf-idf with Plain Text")
# plt.savefig("plots/t-sne_covid19_label_TFID.png")
plt.show()
C:\Users\ankit\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

This time we are able to see the clusters more clearly, as the clusters are further apart from each other. We can also start to see that there is possibly more than 10 clusters we need to identify using k-means.

In [55]:
# Answer 4
# tf-idf vectorizer on the 2-gram representation of documents instead of plain text,

#vectorize 2-gram representation using tf-idf
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(lowercase=False, analyzer=lambda l:l, max_features=2**12)
X = vectorizer.fit_transform(n_gram_all)

# apply k means
from sklearn.cluster import MiniBatchKMeans

k = 10
kmeans = MiniBatchKMeans(n_clusters=k)
y1 = kmeans.fit_predict(X)

# tsne
from sklearn.manifold import TSNE

tsne = TSNE(verbose=1)
X_embedded = tsne.fit_transform(X.toarray())

# sns settings
sns.set(rc={'figure.figsize':(15,15)})

# colors
palette = sns.color_palette("bright", len(set(y1)))

# plot
sns.scatterplot(X_embedded[:,0], X_embedded[:,1], hue=y1, legend='full', palette=palette)
plt.title("t-SNE Covid-19 Articles - Clustered(K-Means) - Tf-idf with Plain Text")
# plt.savefig("plots/t-sne_covid19_label_TFID.png")
plt.show()
C:\Users\ankit\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  warnings.warn(
C:\Users\ankit\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 12500 samples in 0.045s...
[t-SNE] Computed neighbors for 12500 samples in 13.283s...
[t-SNE] Computed conditional probabilities for sample 1000 / 12500
[t-SNE] Computed conditional probabilities for sample 2000 / 12500
[t-SNE] Computed conditional probabilities for sample 3000 / 12500
[t-SNE] Computed conditional probabilities for sample 4000 / 12500
[t-SNE] Computed conditional probabilities for sample 5000 / 12500
[t-SNE] Computed conditional probabilities for sample 6000 / 12500
[t-SNE] Computed conditional probabilities for sample 7000 / 12500
[t-SNE] Computed conditional probabilities for sample 8000 / 12500
[t-SNE] Computed conditional probabilities for sample 9000 / 12500
[t-SNE] Computed conditional probabilities for sample 10000 / 12500
[t-SNE] Computed conditional probabilities for sample 11000 / 12500
[t-SNE] Computed conditional probabilities for sample 12000 / 12500
[t-SNE] Computed conditional probabilities for sample 12500 / 12500
[t-SNE] Mean sigma: 0.265925
[t-SNE] KL divergence after 250 iterations with early exaggeration: 122.639458
[t-SNE] KL divergence after 1000 iterations: 3.281467
C:\Users\ankit\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

Dimensionality Reduction with PCA (Plain text and Tf-idf)¶

t-SNE doesn't scale well. This is why the run-time of this Notebook is about 40 minutes to 1 hour with an average computer. Let's try to see if we can achieve reasonable results with PCA as it scales very well with larger datasets and dimensions:

In [44]:
from sklearn.decomposition import PCA

pca = PCA(n_components=3)
pca_result = pca.fit_transform(X.toarray())

Plot PCA¶

In [45]:
# sns settings
sns.set(rc={'figure.figsize':(15,15)})

# colors
palette = sns.color_palette("bright", len(set(y)))

# plot
sns.scatterplot(pca_result[:,0], pca_result[:,1], hue=y, legend='full', palette=palette)
plt.title("PCA Covid-19 Articles - Clustered (K-Means) - Tf-idf with Plain Text")
# plt.savefig("plots/pca_covid19_label_TFID.png")
plt.show()
C:\Users\ankit\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

Sometimes it may be easier to see the results in a 3 dimensional plot. So let's try to do that:

In [46]:
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

ax = plt.figure(figsize=(16,10)).gca(projection='3d')
ax.scatter(
    xs=pca_result[:,0], 
    ys=pca_result[:,1], 
    zs=pca_result[:,2], 
    c=y, 
    cmap='tab10'
)
ax.set_xlabel('pca-one')
ax.set_ylabel('pca-two')
ax.set_zlabel('pca-three')
plt.title("PCA Covid-19 Articles (3D) - Clustered (K-Means) - Tf-idf with Plain Text")
# plt.savefig("plots/pca_covid19_label_TFID_3d.png")
plt.show()
C:\Users\ankit\AppData\Local\Temp\ipykernel_19724\3963984120.py:5: MatplotlibDeprecationWarning: Calling gca() with keyword arguments was deprecated in Matplotlib 3.4. Starting two minor releases later, gca() will take no keyword arguments. The gca() function should only be used to get the current axes, or if no axes exist, create new axes with default keyword arguments. To create a new axes with non-default arguments, use plt.axes() or plt.subplot().
  ax = plt.figure(figsize=(16,10)).gca(projection='3d')

More Clusters?¶

On our previous plot we could see that there is more clusters than only 10. Let's try to label them:

In [47]:
from sklearn.cluster import MiniBatchKMeans

k = 20
kmeans = MiniBatchKMeans(n_clusters=k)
y_pred = kmeans.fit_predict(X)
y = y_pred
In [48]:
from matplotlib import pyplot as plt
import seaborn as sns
import random 

# sns settings
sns.set(rc={'figure.figsize':(15,15)})

# let's shuffle the list so distinct colors stay next to each other
palette = sns.hls_palette(20, l=.4, s=.9)
random.shuffle(palette)

# plot
sns.scatterplot(X_embedded[:,0], X_embedded[:,1], hue=y, legend='full', palette=palette)
plt.title("t-SNE Covid-19 Articles - Clustered(K-Means) - Tf-idf with Plain Text")
# plt.savefig("plots/t-sne_covid19_20label_TFID.png")
plt.show()
C:\Users\ankit\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

It would be helpful if we have a demo tool that can be used to see what articles are identified as similar using our Clustering and Dimensionality Reduction methods, right? Let's put together a interactive scatter plot of t-SNE to do that.

Interactive t-SNE¶

In [49]:
from bokeh.models import ColumnDataSource, HoverTool, LinearColorMapper, CustomJS
from bokeh.palettes import Category20
from bokeh.transform import linear_cmap
from bokeh.io import output_file, show
from bokeh.transform import transform
from bokeh.io import output_notebook
from bokeh.plotting import figure
from bokeh.layouts import column
from bokeh.models import RadioButtonGroup
from bokeh.models import TextInput
from bokeh.layouts import gridplot
from bokeh.models import Div
from bokeh.models import Paragraph
from bokeh.layouts import column, widgetbox

output_notebook()
y_labels = y_pred

# data sources
source = ColumnDataSource(data=dict(
    x= X_embedded[:,0], 
    y= X_embedded[:,1],
    x_backup = X_embedded[:,0],
    y_backup = X_embedded[:,1],
    desc= y_labels, 
    titles= df_covid['title'],
    authors = df_covid['authors'],
    journal = df_covid['journal'],
    abstract = df_covid['abstract_summary'],
    labels = ["C-" + str(x) for x in y_labels]
    ))

# hover over information
hover = HoverTool(tooltips=[
    ("Title", "@titles{safe}"),
    ("Author(s)", "@authors"),
    ("Journal", "@journal"),
    ("Abstract", "@abstract{safe}"),
],
                 point_policy="follow_mouse")

# map colors
mapper = linear_cmap(field_name='desc', 
                     palette=Category20[20],
                     low=min(y_labels) ,high=max(y_labels))

# prepare the figure
p = figure(plot_width=800, plot_height=800, 
           tools=[hover, 'pan', 'wheel_zoom', 'box_zoom', 'reset'], 
           title="t-SNE Covid-19 Articles, Clustered(K-Means), Tf-idf with Plain Text", 
           toolbar_location="right")

# plot
p.scatter('x', 'y', size=5, 
          source=source,
          fill_color=mapper,
          line_alpha=0.3,
          line_color="black",
          legend = 'labels')

# add callback to control 
callback = CustomJS(args=dict(p=p, source=source), code="""
            
            var radio_value = cb_obj.active;
            var data = source.data; 
            
            x = data['x'];
            y = data['y'];
            
            x_backup = data['x_backup'];
            y_backup = data['y_backup'];
            
            labels = data['desc'];
            
            if (radio_value == '20') {
                for (i = 0; i < x.length; i++) {
                    x[i] = x_backup[i];
                    y[i] = y_backup[i];
                }
            }
            else {
                for (i = 0; i < x.length; i++) {
                    if(labels[i] == radio_value) {
                        x[i] = x_backup[i];
                        y[i] = y_backup[i];
                    } else {
                        x[i] = undefined;
                        y[i] = undefined;
                    }
                }
            }


        source.change.emit();
        """)

# callback for searchbar
keyword_callback = CustomJS(args=dict(p=p, source=source), code="""
            
            var text_value = cb_obj.value;
            var data = source.data; 
            
            x = data['x'];
            y = data['y'];
            
            x_backup = data['x_backup'];
            y_backup = data['y_backup'];
            
            abstract = data['abstract'];
            titles = data['titles'];
            authors = data['authors'];
            journal = data['journal'];

            for (i = 0; i < x.length; i++) {
                if(abstract[i].includes(text_value) || 
                   titles[i].includes(text_value) || 
                   authors[i].includes(text_value) || 
                   journal[i].includes(text_value)) {
                    x[i] = x_backup[i];
                    y[i] = y_backup[i];
                } else {
                    x[i] = undefined;
                    y[i] = undefined;
                }
            }
            


        source.change.emit();
        """)

# option
option = RadioButtonGroup(labels=["C-0", "C-1", "C-2",
                                  "C-3", "C-4", "C-5",
                                  "C-6", "C-7", "C-8",
                                  "C-9", "C-10", "C-11",
                                  "C-12", "C-13", "C-14",
                                  "C-15", "C-16", "C-17",
                                  "C-18", "C-19", "All"], 
                          active=20)#, callback=callback)

# search box
keyword = TextInput(title="Search:")#, callback=keyword_callback)

#header
header = Div(text="""<h1>COVID-19 Literature Cluster</h1>""")

# show
show(column(header, widgetbox(option, keyword),p))
Loading BokehJS ...
BokehDeprecationWarning: 'legend' keyword is deprecated, use explicit 'legend_label', 'legend_field', or 'legend_group' keywords instead
BokehDeprecationWarning: 'WidgetBox' is deprecated and will be removed in Bokeh 3.0, use 'bokeh.models.Column' instead

Please see the tools on right top.¶

If the text doesn't fit in the screen on the above plot when you hover, please try the 'Box Zoom' tool to zoom to the area where the target plot is. This will help the hover message to fit inside the screen.¶

Use the 'Reset' button to revert the zoom.¶

This notebook is an adaption from the following kaggle notebook https://www.kaggle.com/maksimeren/covid-19-literature-clustering

You can find the full version of the interactive plot here on GitHub:¶

https://maksimekin.github.io/COVID19-Literature-Clustering/plots/t-sne_covid-19_interactive.html¶

In [ ]: